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http://github.com/nst/UnicodePoster 






Unicode does not address characters rendering 



glyphs 




text rendering engine 
NSLayoutManager 




codepoints 
U+2603 SNOWMAN 



binary representation 
E2 98 83 (UTF-8) 



fonts 

Times New Roman. ttf 



Times New Roman.ttf 

TrueType and OpenType 
fonts can contain up to 
2 A 16 glyphs ie 65' 536. 







0x70 



BOPOMOFO 



*7 



0x71 



CJK BASIC 




STBOKES 



0x72 



KHTRKHNH 



7 



0x73 



ENCLOSED 



(*) 



0x74 




0x75 



CJK IDEOGR. 






0x76 





VI SVLLRBLES 




0x77 



0x78 



VI SVLLRBLES 



0x79 



VI RRDICRLS 



J 



VI RRDICRLS 



0x7A 






'Ll 




0x7B 



0x7C 



0x7D 



0x7E 



0x7F 




LRTIN EKT. D 



LRTIN EKT. D 




SVLOTI NRGRI 



AM m 



SVLOTI NRGRI 




PHRGS-PR 



TRI DIET 



^11 Apple Last Resort Font m PI Ifr 
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SHRRRDR 



SHRRRDR 
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TRKRI 



30 m m Tf 



TRKRI 



0xC5 




CUNEIFORM 



CUNEIFORM 
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0xC7 



0xC8 



0xC9 



MIRO 
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MIRO 



0xCA 




KRNR 



SUPPL. 



0xCB 




BVZRNTINE 



MUSICAL 



7 11 9 ID 



MUSICRL 



0xCC 



SYMBOLS 



0xCD 




RNC. GREEK 



MUSICRL 



0xCE 




0xCF 




RLPHRNUM 



0xD0 



MATH SYMBOLS 



0xDl 




RRRB. MRTH. 



As IJ^ 



ALPHA. SVM. 



0xD2 
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DOMINO 



* * 

• ^ 



TILES 



0xD4 




ENCL. RLPHR 



ENCL. IDEO. 



(A) F* 



0xD5 



SUPPL. 



0xD6 



SUPPL. 



0xD7 




0xD8 



0xD9 



0xDA 



RLCHEMICRL 






SYMBOLS 



0xDB 




0xDC 



0xDD 



CJK IDEOGR. 



II 



EHT. B 



0xDE 




0xDF 




0xE0 



CJK IDEOGR. 



9 



COMPRT. SUP. 



0xEl 




0xE2 



0xE3 



0xE4 



0xE5 



0xE6 



0xE7 
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0xEB 



0xEC 



0xED 




NOT 



0xEE 
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Unicode Technical Reports 



UTR (Unicode Technical Report) 

informative material 




UAX (Unicode Standard Annex) 
integral part of the standard 



UTS (Unicode Technical Standard) 
independant specification 



http://www.unicode.org/reports/about-reports.html 











Unicode Character Database (UCD), TR#44 (UAX) 



http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt 




0. Codepoint 


00E9 




1 . Name 


LATIN SMALL LETTER E WITH ACUTE 




2. General_Category 


LI 


a lowercase letter 


3. Canonical_Combining_Class 


0 


not reordered 


4. Bidi_Class 


L 


left to right 


5. Decomposition_Type, 


0065 0301 




6. Numeric_Type, Numeric Value 






7. Numeric_Type, Numeric Value 






8. Numeric_Type, Numeric Value 






9. Bidi_Mirrored 


N 


Y if mirrored in a bidirectional text 


10. Unicode_1_Name (Obsolete) 


LATIN SMALL LETTER E ACUTE 


name in Unicode 1 .0 


1 1 . ISO_Comment (Obsolete) 






12. Simple_Uppercase_Mapping 


00C9 




13. Simple_Lowercase_Mapping 




already lowercase 


14. Simple_Titlecase_Mapping 


00C9 






Unicode Technical Committee Minutes 



The Unicode Technical Committee (UTC) meets quarterly each year. Meeting minutes document the 
decisions, actions and voting record of the Full, Institutional, and Supporting Members of the Committee 
through numbered motions, consensus statements, and action items. Approved meeting minutes are ones 
that have been reviewed and approved by the UTC, preliminary minutes are ones posted for final public 
review prior to their approval at the next meeting of the UTC. 

In addition, draft minutes are available via the current document register . These are unapproved minutes 
from the most recent UTC and are subject to revision before final versions are posted. 



UTC Minutes 


Status 


Location 


Dates 


UTC 1 43 




San Jose, CA 


May 4-8, 2015 


UTC 142 




San Jose, CA 


Feb. 2-5, 2015 


UTC 141 




Sunnyvale, CA 


Oct. 27-31, 2014 


UTC 140 


Draft 


Redmond, WA 


August 5-8, 2014 


UTC 1 39 


Draft 


San Jose, CA 


May 6-9, 2014 


UTC 1 38 


Draft 


San Jose, CA 


Feb 3-6, 2014 


UTC 1 37 


Draft 


Cupertino, CA 


November 4-7, 2013 


UTC 1 36 


Approved 


Redmond, WA 


July 29 - August 2, 2013 


UTC 1 35 


Approved 


San Jose, CA 


May 6 -10, 2013 


UTC 1 34 


Approved 


San Jose, CA 


Jan 28 -Feb 1,2013 


UTC 1 33 


Approved 


Cupertino, CA 


November 5-9, 2012 


UTC 1 32 


Draft 


Redmond, WA 


July 30 - August 6, 2012 


UTC 131 


Approved 


San Jose, CA 


May 7-11, 2012 



Eg. Proposal to encode 

GREEK BYZANTINE DOUBLE SUSPENSION MARK 



L2/14-157 


ISO/IEC JTC 1 /SC 2/WG 2 

PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS 
FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC 10646 1 
Please fill all the sections A, B and C below. 

Please read Principles and Procedures Document (P & P) from http://www.dkuug.dk/JTC1/SC2/WG2/docs/principles.html for 

guidelines and details before filling this form. 

Please ensure vou are usina the latest Form from http://www.dkuua.dk/JTC1/SC2/WG2/docs/summarvform.html . 

See also http://www.dkuuq.dk/JTC1/SC2/WG2/docs/roadmaps.html for latest Roadmaps. 




A. Administrative 




1. Title: > 

Proposal to encode GREEK BYZANTINE DOUBLE SUSPENSION MARK : ' 

2. Requester's name: Dumbarton Oaks 

3. Requester type (Member body/Liaison/Individual contribution): individual contribution 

4. Submission date: 2014-07-18 

5. Requester's reference (if applicable): 

6. Choose one of the following: 

This is a complete proposal: yes 

(or) More information will be provided later: 





Byzantine seals 

DOSeals 2:40.1 1 (10th c.): In this example, the typesetter tried, but failed, to get the two characters to 
align vertically (they used two legacy characters with incompatible kerning values) 



40.11 Leo imperial protospatharios and ek prosopou of Aigaion Felagos 
(X c.) 

DO 58.106.2474. — D. 21 mm. W. 6.42 g. Oxidated. 

Unpublished. / 

Obv. Patriarchal cross on three s^ps with fleurons (up to first arm). Along the border of 
dots, inscription: +K€f\OK^eiT(JCOAeA \ 

Rev. Inscription of five lines, girder of dots. 

+A€ON.|fVA cnA© .|p.s€K,npo.|un.T.en.|neAAr 

+K(vpi)e poriGei to> ocp 6ox>X(q>) A£ov[x(i)] P«xoiXik$) (^pa)xo)O7ta0[a]p({a)) 
(kcxi) tic 7cpo[o]a)7t(o\)) t(ou) *Ey{[oi)] neXayfou^). 

The lorm Eyiou is not unique, cf. no. 40.7 above. The IXth-century seal of Nicholas ek 
prosopou (without mention of the province) was discovered in the neighborhood of Methymna 

(SBS 2 [1990] 167). This could well be an ek prosopou of the Aigaion Pclagos, like the 
owner of the present seal. 





j 





(l) DO 55.1.847. — D. 24 mm. W. 12.67 g. Blank too small 1 

(m) DO 55.1.848. — D. 23 mm. W. 11.34 g. Blank too small 1 

(n) DO 55.1.849.— D. 24 mm. W. 15.09 g. Blank too smal 
straight sides. Channel off center. 

(o) DO 55.1.850. — D, 23 mm. W. 8.12 g. Blank too small fo 

(p) DO 55.1.851. — D. 25 mm. W. 11.17 g. Blank too smal 
straight sides. Channel off center. 

(q) DO 55.1.852. — D. 23 mm. W. 1 1.99 g. Blank too small 1 

(r) DO 55.1.853. — D, 24 mm. W. 12.57 g. Blank too small I 

(s) DO 55.1.854.— D. 26 mm. W. 12.38 g. Channel off cent 

(t) DO 55.1.855.— D. 24 mm. W. 8.91 g. Blank too small fo 

(u) DO 55.1.856. — D. 24 mm. W. 13.08 g. Blank too small 1 

(v) DO 55.1.857. — D. 29 mm. W. 14.19 g. 

(w) Fogg 1946. — D. 31 mm. W. 19.97 g. Corroded and crackt 

(x) Fogg 2312. — D. 25 mm. W. 10.51 g. Blank too small for 
Although the inscriptions of many of the specimens above < 

appear that all are from the same boulloterion. The reading 
inscriptions. 

Ed. Zacos-Veglery, 1886(a) (= our specimen a); 1886(b) ( 
specimen c). 

Obv. Cruciform invccativc monogram (type VIII); in the c 

border. IX / / 

Rev. Inscription of fi\ e fines. Wri&th border.^ 

+ eVTIIi|lANOB A.Cn| A0.SCTP A|Tir.feA A| AOC 
Kopie poriGei xcp ow Soutap Eixpipiavo P(aoiA 
axpaxiy(q)) Etai5oi;. 

An epigraphical oddity is the presence of double abbr 
the existence of several seals w ith the channel well off-cer 
the blanks used by Euphemianos, some of which also pres 
a mold with some straight (non-curved) sides. These pheno 
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http://www.unicodeconference.org 



http://www.unicodeconference.org/conference-at-a-glance.htm 



• • 



Encodings 




• •• 



e 

U+00E9 LATIN 
SMALL LETTER E 
WITH ACUTE 





FF FE 00 00 
E9 00 00 00 



▼ 

UTF-16 : FF FE E9 00 






0x0000 



Direct representation of the codepoint on 32 bits. 



UTF-32 



Disadvantage: 4 bytes per character is space inefficient. 



Example with U+266A J> « EIGHTH NOTE » 
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0 


0 


0 


0 


0 


0 


0 



0 


0 


0 


0 


0 


0 


0 


0 



0 


1 


1 


0 


1 


0 


1 


0 



0 


0 


1 


0 


0 


1 


1 


0 



0x00 0x00 0x26 0x6A 



OxlOFFFF 








0x0000 



0xD800 

OxEOOO 

OxFFFF 



0x010000 




OxlOFFFF 



• Most common 63K characters encoded on single 16 bits code units. 

• Example with U+266A X « EIGHTH NOTE » 



0 


0 


1 


0 


0 


1 


1 


0 



0 


1 


1 


0 


1 


0 


1 


0 



UTF-16 



0x26 0x6A 



• Other non-BMP codepoints encode 20 bits in a pair of 16 bits surrogates. 

• Example with U+1D11E 4 « MUSICAL SYMBOL G CLEF » 
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0 


1 
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0 
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1 
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1 


1 


0 



OxlDllE 



* Substract 0x10000 (for a 20 bits space), fill surrogates with 2 times 10 bits 
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0 
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0 


0 


0 
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0 


0 


0 


0 






0xD8 










0x00 










OxDC 










0x00 
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1 


0 


1 


1 


0 


0 


0 


0 


0 


1 


1 


0 


1 


0 


0 


1 


1 


0 


1 


1 


1 


0 


1 


0 


0 


0 


1 


1 


1 


1 


0 



0xD8 0x34 OxDD OxlE 
















0x010000 




OxlOFFFF 



7-bits codepoints (« Basic Latin ») U+0041 A « LATIN CAPITAL LETTER A » 
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1 
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0 


0 


0 


1 



0x0041 

0x41 



UTF 




11-bits codepoints, ie blocks « Latin 1 », « Cyrillic », « Arabic », 
Ex. U+036C cp « GREEK SMALL LETTER PHI » 
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1 
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1 


1 


0 


0 


0 


1 


1 


0 


0x03C6 
OxCF 0x86 


1 


1 


0 


0 


1 


1 


1 


1 


1 


0 


0 


0 


0 


1 


1 


0 


• 16-bits codepoints, 


ex. U+266A X « EIGHTH NOTE » 
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0 
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0 
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0 


1 


1 


1 
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0 
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1 


0 


1 
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0 
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1 


0 


0 


1 


1 


0 


1 


0 


1 


0 


1 


0 



0x266A 

0xE2 0x99 OxAA 



21-bits codepoints, ex. U+1D11E < ? « MUSICAL SYMBOL G CLEF » 
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0 
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0 





i-H 
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1 
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0 



OxlDllE 
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1 
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0 
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1 
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0 
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0 



OxFO 



0x9D 



0x84 



0x9E 






















Figure 2-12. Unicode Encoding Schemes 



A 


n 


^£1 

RP 


III 


00 1 00 1 00 1 41 


00 1 00 1 03 | A9 


00 1 00 | 8A |9E 


00 1 01 1 03 1 84 



A 


n 


PP 


III 


41 | 00 | 00 |00 


A9 | 03 | 00 1 00 


9E | 8A | 00 1 00 


84 | 03 | 01 |00 




UTF-32BE 
UTF-32LE 
UTF-16BE 
UTF- 1 6LE 
UTF-8 



in Unicode Standard 7.0, page 41 



Norma ization 



Canonical Equivalence 

Two code points sequences with: 

- same appearance 

- same meaning 



o 

A 

U+212B 



O 

• • 

• • 

• • 



U+0041 




U+030A 



TR#15 (UAX) 

Compatibility Equivalence 

Two code points sequences with: 

- possibly distinct appearances 

- the same meaning in some contexts 





1 



U+0066 



U+0069 



r 

e 


(D 


U+00E9 


U+2460 




Canonical decomposition 


Compatibility decomposition 












e 6 (T) 

U+0065 U+0301 U+2460 






e 6 1 

U+0065 U+0301 U+0031 


NFD 


NFKD 










e ® 




NFC 




NFKC 




e 1 


U+0065 U+2460 


(most common) 




U+00E9 U+0031 



NFC doesn t always compose 



U+05E9 



U+FB2C 



HEBREW LETTER 
SHIN WITH DAGESH 
AND SHIN DOT 



NFC(U+FB2C) 



HEBREW LETTER 
SHIN WITH DAGESH 
AND SHIN DOT 



HEBREW LETTER 
SHIN 

U+05BC 



* HEBREW LETTER 

• • 

SHIN DOT 

U+05C1 




NFKD Maximum Expansion 




U+FDFA 

ARABIC 

LIGATURE 

SALLALLAHOU 

ALAYHE 

WASALLAM 



»> import unicodedata 

»> s = '\uFDFA f 
»> len(s) 

1 

»> s_nfkd = unicodedata.normalizeCNFKD' , s) 

»> s_nf kd . encode( f unicode-escape T ) 

b 1 Wu0635\\u0644\\u0649 Wu0627\\u0644\\u0644\\u0647 Wu0639\ 
\u0644\\u064a\\u0647 Wu0648\\u0633\\u0644\\u0645 T 
»> len(s_nfkd) 

18 



Unicode Collation 

TR#10 (UTS) 

About text comparison 

cafe < cafe ? 
cafe < cafe ? 

Language dependant 

Usage dependant 

German dictionary: of < of 
German phonebook: of < of 



Algorithm (UCA) 

Customizable 

lower first or upper first, ... 
numeric ordering, ... 

Context dependant 

Normal Accent Ordering 
cote < cote < cote < cote 
Backward Accent Ordering (FR) 
cote < cote < cote < cote 

Unstable over time 




Language Dependant Collation 



German 






Swedish 


Akersberga 


1 


2 


Alingsas 


Alingsas 


2 


4 


Oskarshamn 


Applebo 


3 


7 


Utting 


Oskarshamn 


4 


6 


Uttfeld 


Ostersund 


5 


8 


Zwickau 


Uttfeld 


6 


1 


Akersberga 


Utting 


7 


3 


Applebo 


Zwickau 


8 


5 


Ostersund 



( Steven R. Loomis, Mark Davis) 




DUCET (Default Unicode Collation Element Table) 



http://www.unicode.org/Public/UCA/latest/allkevs.txt 



Character 


Collation Element 


Name 


0300 """ 


[ .0000.0025.0002] 


COMBINING GRAVE ACCENT 


0061 "a" 


[ .190C. 0020. 0002] 


LATIN SMALL LETTER A 


0062 "b" 


[ .1925.0020.0002] 


LATIN SMALL LETTER B 


0063 "c" 


[ .193E. 0020. 0002] 


LATIN SMALL LETTER C 


0043 "C" 


[ .193E. 0020. 0008] 


LATIN CAPITAL LETTER C 


0064 "d" 


[ .1953.0020.0002] 

/ i \ 


LATIN SMALL LETTER D 




alphabetic diacritic case 
ordering ordering ordering 




Algorithm 



NFD 


Collation Element Array 


cab 


[ .193E. 002 0.0002] [. 190C . 0020 . 0002 ] [. 1925 . 0020 . 0002 ] 


Cab 


[ . 193E. 0020. 0008] [ . 190C . 0020 . 0002 ] [ .1925.0020.0002] 


cab 


[ . 19 3E. 0020. 0002] [. 19 0C. 002 0.0002] [.0000.0025.0002] [.1925.0020.0002] 


dab 


[ . 1953.0020.0002] [ . 190C . 0020 . 0002 ] [ .1925.0020.0002] 



NFD 


Sort Key 


cab 


193E 190C 1925 0020 0020 0020 0002 0002 0002 


Cab 


193E 190C 1925 0020 0020 0020 0008 0002 0002 


cab 


193E 190C 1925 0020 0020 0025 0020 0002 0002 0002 0002 


dab 


1953 190C 1925 0020 0020 0020 0002 0002 0002 





Case Folding 



# The data supports both implementations that require simple case foldings 

# (where string lengths don't change), and implementations that allow full case folding 

# (where string lengths may grow) . Note that where they can be supported, the 

# full case foldings are superior: for example, they allow "MASSE" and "MaBe" to match. 



00C9 ; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE 
OODF; F; 0073 0073; # LATIN SMALL LETTER SHARP S 

http://www.unicode.org/Public/UNIDATA/CaseFolding.txt 



http://userguide.icu-proiect.org/transforms/casemappings 




Case Convers on 



Figure 5-16. Casing of German Sharp S 

Default Casing Tailored Casing 

13 

ss 




Case Convers on 



I 

U+0049 

n 

l 

U+0131 












••••• 












••••• 



U+0130 



U+0049 U+0307 





U+0130 U+0307 




1 



1 



• • 

• • 
• • 



U+0069 



U+0069 U+0307 



Posix Locale 



Turkish Locale 



(e a picture) 

Emojis (mo = writing) 

^ (ji = character) 



Early 2000s: Emoji became generally available on 
Japanese cell phones. 

Late 2000s, standardized and added into Unicode 
6.0 ( 2010 ) 

Submit your own: http://www.unicode.org/ 
pending/proposals.html and join rejected ones 
http://www.unicode.org/alloc/nonapprovals.html 



Emoji Symbols: Background Data 

Background data for Proposal for Encoding Emoji Symbols 

N3xxx 

Date: 2010-Apr-27 
Authors: 

Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.) 

Yasuo Kida, Peter Edberg (Apple Inc.) 

This document reflects proposed Emoji symbols data as shown in FDAM8 which includes the disposition of FPDAM8 ballot comments and changes agreed during 
the San Jos§ WG2 meeting 56. 

The carrier symbol images in this file point to images on other sites. The images are only for comparison and may change. 

See the chart legend for an explanation of the data presentation in this chart. 

In the HTML version of this document, each symbol row has an anchor to allow direct linking by appending #e-4B0 (for example) to this page’s URL in the address 
bar. 



Internal 

ID 


Symbol 


Name & Annotations 


DoCoMo 


KDDI 


SoftBank 


Google 


Enclosed alphanumeric symbols 



e-82C 




U+0023 

U+20E3 

unified (Unicode 3.0) 



HASH KEY 



JL #123 

'Sharp dial’ 5/*—' 
Tshiyaapudaiyaruj 
U+E6E0 

SJIS-F985 JIS-7B69 



LI #818 
# 

U+EB84 

SJIS-F489 JIS-7B69 



imi #403 #old196 
# 

U+E210 

SJIS-F7B0 



U+FE82C 



e-837 




KEYCAP 0 



U+0030 

U+20E3 

unified (Uncode 3.0} 



GO #134 
U+E6EB 

SJIS-F990 JIS-784B 



J#325 

U+E5AC 

SJIS-F7C9 JIS-784B 



#402 #old217 

0 

U+E225 

SJIS-F7C5 



U+FE837 



e-82E 



e-82F 






KEYCAP 1 



U+0031 

U+20E3 

unified (Unicode 3.0) 




KEYCAP 2 



U+0032 



CD #125 
T 1 

U+E6E2 

SJIS-F987 JIS-767D 



#180 
ES&^1 
U+E522 

SJIS-F6FB JIS-767D 



nn#i26 
' 2' 2 

U+E6E3 

SJIS-F988 JIS-767E 



#181 
U+E523 

SJIS-F6FC JIS-767E 



#393 #old208 

1 

U+E21C 

SJIS-F7BC 



#394 #old209 

2 

U+E21D 

SJIS-F7BD 



U+FE82E 



U+FE82F 





Aweful Support in Chrome 



fit Safari File Edit View History Bookmarks Develop Window Help 



seriot.ch/visualizati...s/plane_00_emojis.txt 



+_.Q seriot.ch 

0x2708 
0x2709 
0x270a 
0x270b 
0x270c 

0x270f 

0x2728 

0x2733 
0x2734 

0x2744 

0x2747 

0x274c 

0x274e 



visualization/ unicoae/emoi 



m 

w 



¥ 

* 

5 *: 

* 




A 

€ 

& 

& 



* 

¥ 

* 



□ 




seriot.ch/visualization/u 






\T 



li 









» 



<- «> c 


Q seriot.ch/visualizatio... Q. ^ 




0x2708 


* 


* 




0x2709 








0x270a 


□ 


□□ 


□ □ 


0x270b 

0x270c 


□ 


□□ 

£ 


□ □ 


0x270f 


& 


8=t> 




0x2728 


□ 


□ □ 


□ □ 


0x2733 

0x2734 


* 

* 


* 

* 




0x2744 


* 


* 




0x2747 


* 


* 




0x274c 


□ 


□ □ 


□ □ 


0x274e 


□ 


□ □ 


□ □ 



Emojis Evolution 



Discussions about Emojis Diversity in meetings minutes 

http://www.unicode.org/L2/L2014/14172r-emoii-enhancements.pdf 

http://www.unicode.Org/L2/L2014/14177.htm#140-C28 

UTC Meeting T140-A471 Action Item for Mark Davis: Talk to Facebook and 
Twitter to see if they would like to get more involved. 




Variation Se ectors 



may modify some glyph appearance 
16 VS in BMP: U+FEOO to U+FEFF 
240 more VS in plane 14 







U+FE0E 


U+FE0F 






U+FE0E 


U+FE0F 






U+FE0E 


U+FE0F 






U+FE0E 


U+FE0F 


U+203C 


II 
• • 


II 
• • 


If 

• • 


U+2600 


• 


* 




U+2693 


0 


A 


9 

vl/ 


U+2733 


* 


* 


a 


U+2049 


1? 

• a 


1? 

• • 


I? 

• • 


U+2601 


4» 


4a 




U+26A0 


A 


A 


f 


U+2734 




* 


a 


U+2139 


• 

i 


§ 

1 


D 


U+260E 


s 


Q 


a 


U+26A1 




* 




U+2744 








U+2194 






fcj 


U+2611 


E) 


0 


o 


U+26AA 




O 




U+2747 


* 


* 


a 


U+2195 


I 


t 


u 


U+2614 


T 




jMtk 


U+26AB 


• 


• 


• 


U+274C 


X 


X 


X 


U+2196 


\ 


\ 


y 


U+2615 








U+26BD 






$ 


U+274E 


O 


□ 


□ 


U+2197 


/ 


/ 


u 


U+261D 


9 




N 


U+26BE 


© 


© 




U+2753 


? 

# 


? 


? 


U+2198 


\ 


\ 


S3 


U+263A 


© 


© 




U+26C4 


A 


iQr 


A 


U+2754 




? 




U+2199 


/ 


/ 


□ 


U+2648 


is 






U+26C5 




£5 




U+2755 




B 




U+21A9 




4-> 


Q 


U+2649 


IS 




ksi 


U+26CE 


S3 




a 


U+2757 


i 

• 


! 


1 

• 


U+21AA 


<-► 


<-► 


g 


U+264A 


u 


n 


u 


U+26D4 


e 


© 


e 


U+2764 








U+231A 


T. 


0 


7 


U+264B 


a 


$ 


a 


U+26EA 


d 


A 


si 


U+27A1 


■4 


■* 


y 


U+231B 


V 


X 


y 


U+264C 


y 


*1 


y 


U+26F2 


□ 




□ 


U+27BF 




-► 


S3 


U+23E9 


U 


►► 


u 


U+264D 




TIP 




U+26F3 




£ 




U+2934 


j 




y 


U+23EA 


u 


44 


y 


U+264E 


yy 




m 


U+26F5 


1 


£ 


A 


U+2935 


% 




kJ 


U+23EB 




1 


ww 


U+264F 


HI 


TTL 


a 


U+26FA 


cs 


A 


H 


U+2B05 






U 


U+23EC 


Q 


▼ 

▼ 


y 


U+2650 


(3 


y 


Q 


U+26FD 


f 


D 


f 


U+2B06 


t 


t 


|| 


U+23F0 


0 


0 


o 


U+2651 


a 


VS 


a 


U+2702 






> 


U+2B07 


4- 


4 


y 


U+23F3 


▼ 


£ 


▼ 


U+2652 




AA/ 

AA/ 




U+2705 


am 


c/ 


a 


U+2B1B 


■ 


■ 


■ 


U+25AA 


■ 


■ 


m 


U+2653 


a 


H 


a 


U+2708 


+ 


* 


jA 

VI 


U+2B1C 




□ 




U+25AB 


c 


c 




U+2660 


♦ 


♦ 


* 


U+2709 


X 






U+2B50 




* 




U+25B6 


► 


► 


y 


U+2663 


* 


* 


4 


U+270A 


'S' 


€> 




U+2B55 


O 


o 


o 


U+25C0 


◄ 


4 


y 


U+2665 


* 






U+270B 




& 




U+303D 


A 


A 




U+25FB 


□ 


pi 




U+2666 


♦ 


♦ 


♦ 


U+270C 


& 


g 


-< 


U+3297 


® 


® 


© 


U+25FC 


■ 


■ 


H 


U+2668 


<$> 


& 


& 


U+270F 




n> 




U+3299 




s 


© 


U+25FD 




□ 




U+267B 


o 


o 


w 


U+2728 
















U+25FE 


■ 


■ 


M 


U+267F 


fc»J 


& 


a 



















BPM Emojis variations 
with VS15 and VS16 



Figure 1: Typical Cyrillic Small Letter Ve (boxed in black) and variant form (boxed in red). Source: 
Bible printed by Francysk Skaryna, Prague, circa 1519. 

fOjAWO/fvHCfiM ytwW AbkIkimm nETeTK^H • 34N\nn 

i , — * « 

nOKOMAH(M S4Sbl Tb!* S0r4i HMTi JLOMM • H 34 

noe^A 4 tjufio mi* jceo[i,M rAw|e« |wy 



II 

rnl 



|)iKtt~Bej(l3HTt joiAl 

-{TV ^ 



A 1 



Figure 2: Typical Cyrillic Small Letter Ve (boxed in black) and variant form (boxed in red). Source: 
Bible printed by Francysk Skaryna, Prague, circa 1519. 

I™""™""! ^ 

^ H PH* Wh ^4C0K 80^ 1 S/K* C\fTfc flOHK* 

, f * 0 ^ Clinch *A HH0 *W C\fUI4 1 HSUCTfc T4K0 

t2ll AH H H4fH£ Kon C\fUJ\f 3fMM HCOt^HH^BO^b H4 ^bT|hO^ 

Hew Eon m ^oe/o hphj i4HjHfC{Tt 3 imaa euak 



Table 1 : Table of Proposed Variation Sequences 



Sequence 


Glyph 


Name 


HIP 


Belgrade 


U+0432 U+FEOO 


e 


CYRILLIC SMALL LETTER VE VARIANT- 1 
ROUNDED VEDI 


• 


* 


U+0432 U+FEOF 


& 


CYRILLIC SMALL LETTER VE BASE FORM 


B 


E053 


U+0434 U+FEOO 


A 


CYRILLIC SMALL LETTER DE VARIANT- 1 
LONG-LEGGED DOBRO 


<A> 


— 


U+0434 U+FEOF 


A 


CYRILLIC SMALL LETTER DE BASE FORM 


A 


E055 


U+043E U+FEOO 


0 


CYRILLIC SMALL LETTER O VARIANT- 1 NAR- 
ROW ON 


< o_ > 


E069 


U+043E U+FEOF 


0 


CYRILLIC SMALL LETTER O BASE FORM 


o 


E06A 


U+0441 U+FEOO 


C 


CYRILLIC SMALL LETTER ES VARIANT- 1 
WIDE SLOVO 


<c> 


E167 


U+0441 U+FEOF 


c 


CYRILLIC SMALL LETTER ES BASE FORM 


c 


E06F 


U+0442 U+FEOO 


7 


CYRILLIC SMALL LETTER TE VARIANT- 1 
TALL TVERDO 


<T> 


— 


U+0442 U+FE01 


m 


CYRILLIC SMALL LETTER TE VARIANT-2 
OLD-STYLE TVERDO 


<|t|> 


— 


U+0442 U+FEOF 


T 


CYRILLIC SMALL LETTER TE BASE FORM 


T 


E070 


U+044A U+FEOO 


7 


CYRILLIC SMALL LETTER HARD SIGN 
VARIANT- 1 TALL HARD SIGN 


<T>> 


— 


U+044A U+FEOF 


X 


CYRILLIC SMALL LETTER HARD SIGN BASE 
FORM 


T> 


E080 


U+0463 U+FEOO 


t 


CYRILLIC SMALL LETTER YAT VARIANT- 1 
TALL YAT 


<jb> 


- 


U+0463 U+FEOF 


ffc 


CYRILLIC SMALL LETTER YAT BASE FORM 


j*> 


E086 


U+A64B U+FEOO 


r 


CYRILLIC SMALL LETTER MONOGRAPH UK 
VARIANT- 1 CHECKMARK-SHAPED UK 


<OV> 


— 


U+A64B U+FEOF 




CYRILLIC SMALL LETTER MONOGRAPH UK 
BASE FORM 


y 


E072 



Proposal to Use Standardized Variation Sequences to 
Encode Church Slavonic Glyph Variants in Unicode 





Country Flags 



Oxlf le6 


+ 


Oxlf le7 


p - - - - n p 1 

IAMBI 

L - - - - J 


Oxlf le8 


+ 


Oxlf lf3 


P - 1 P----1 

Ml ici INI 

L - _ _ - J L - - - _ J 


Oxlf le9 


+ 


Oxlf lea 


m 


Oxlf lea 


+ 


Oxlf lf8 


is 


Oxlf leb 


+ 


Oxlf lf7 


ii 


Oxlf lec 


+ 


Oxlf le7 


sis 


Oxlf lee 


+ 


Oxlf lf9 


Ml 


Oxlf lef 


+ 


Oxlf If 5 


• 


Oxlf lfO 


+ 


Oxlf lf7 




Oxlf If 7 


+ 


Oxlf If a 


Mi 


Oxlf If a 


+ 


Oxlf lf8 


m 



Unicode Common Locale Data Repository (CLDR) TR#35 (UTS) 

• Locale-specific patterns for formatting and parsing 

dates, times, timezones, numbers and currency values 

• Translations of names 

countries and regions, currencies, eras, months, 
weekdays, timezones, cities, time units, ... 



* Language & script information 

characters used; sorting & searching; writing direction; 
numbers spellings; segmentation, ... 



• Country information 

language usage, currency information, calendar 
preference and week conventions, ... 



’ , [ CIOR R*i**ms/D©w«Ioj * V, 

C l cldr.unicode.org 'index/downloads 




Navlpedon 



Unicode CLOR Project 

CLDR Survey Tool 

CLDR Charge Requests 

CLDR Chans 

CLDR Process 

CLDR Specifications 

Tran sia^on Gutdefcoes 

Unicode Extensions lor BCP 47 



nil— l o n e Schedule 



Date Plate 

2014 - 03-19 v 23 Rtlttttcl 
2014 - 09-18 v 28 Raiaa— d 



See Qeneral Schedule 



Internal Development 



CLDR Development She 

Nee CLOR Developers 
Handling Tickets 
(bugsienhancements) 



CLDR: Big Red Switch 

Messages 



Design Proposals 

Direct Modifications to CLDR Data 



Updating Codes 
Updating DTDs 
Editing the CLDR Spec 



Sitemap 
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All Right* Reserved 

Terms of Use 



Unicode CLDR Protect > 

CLDR Releases/Downloads 



Each release of the Unicode CLDR is a stable release and may be used as reference material or ctled as a normative reference by other specifications Each version, once 
published, rs absolutely stable and never change Implementations may also apply CLDR Corngepda to a release Bug reports and feature reguests for subseguent 
versions may be tiled at Bug Reports 

Downloads 

The following table lists the files for each released versior For license ^formation, see the Unicode Terms of Use : in particular. Exhibrt 1 The top two rows have penmalinks for 
the latest version and the latest development version (snapshot) They are followed by the specific release versions 



No. 


Date Rel. Note 


Data Charts 


Spec 


Delta 


SVN Tag 


DTD DifVs 


Latest 


latest versior 


i latest-data latest -charts 


latesUdml 


latest -changoi 


i latest 




Dov 


I dev -version 


dev -charts 


doV'Idm 1 


dev -changes 


trunk 


t rush 


26 
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LDML26 
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release-26 
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25 
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release-2 S 
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24 
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CLDR24 Chans 24 


LDML24 
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release -24 


AD7024 
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LDML23 
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releaee-23-1 
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23 
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LDML23 


m 


relcase-23 
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22 1 
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CLDR22 1 Chart822 i 


LDML22 1 


A22 1 


rolease-2?-l 


7022 . : 


22 


2012-09-10 v22 


CLDR22 Chan 922 


LPML22 


422 


rclcasc-22 


AD7D22 


21 02 


2012*06-06 v21 0 2 


via SVN 


LDML21 0 1 A21 0 2 


rclcaac-2 1-0-2 


AO7D21.0.2 


21.01 


2012-03-21 v21 0 1 


via SVN 


LDML21 0 1 A21 0 1 


rclcaso-2 1-0-1 


&D7D2 1.0.1 


21 0 


2012-02-10 y21 


CIDR21 


LDML21 


cm 


release- 21 


4DTD21 


2-0-1 


2011-07 18 v2 0 1 


CU282JL1 


LDML2 0.1 


A2 0 1 


ralaaaa-S-O-l 


[APTD2 .0.1 


20 
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CLDR2 0 |~ 


LDML2 C 


J 


release -2-0 
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SI 


release-1-9-1 


AD70 1.9.1 


19 


2010-12-01 1*18 


00819 0 


LDML1.9 


A1.9 


release- 1-9 
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LDML18 1 


Al.8.1 


release-1-8-1 
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International Components for Unicode (ICU) 



Open-source project on top of CLDR 

Unicode text handling and regular expressions 

character, word, and line boundaries 

Language sensitive collation and searching 

Normalization, upper and lowercase conversion 

multi-calendar and time zones 

parse and format dates, times, numbers, currencies 




Descends from Taligent (mid 1990s), which became part of IBM in 1996 



Included by Sun into JDK 1.1 




More Specifications 



• Text Segmentation TR#29 (UAX) 

• About when to words and lines, contextual 

• Regular Expressions TR#18 (UTS) 

• Bidirectional Algorithm TR#9 (UAX) 

• Arabic, Hebrew, ... display text from right to left but use left to right digits 




3. Unicode in Practice 



OS X Unicode Hex Input 




Swiss French 



u+ 



El Show Character Viewer 
@ Show Keyboard Viewer 

Show Input Source Name 

Open Keyboard Preferences... 



$ python3 

»> u = ' \U0001F41B ' 

»> print (u) 

%> 

»> import unicodedata 
»> unicodedata. name (u) 

'BUG' 

»> u2 = unicodedata, lookup ( "BUG" ) 
»> print (u2) 

% 





Code Points <-> Bytes 



u" abc\u27A2 " 


encode 


k 


k 

decode 


UTF-8 

> 




(UTF-8 


' abc\xe2\x9e\xa2 ' 







C/C++ 

Use wchar_t* ("wide char") instead of char* 

Use the wcs functions instead of the str functions 
strcat => wcscat 
strlen => wcslen 

Convert char strings into wchar_t strings 
mbstowcs multi byte string to wide char string 
wcstombs wide char string to multi byte string 



Create a literal UCS-2 string: 
L"Hello" 







#include <stdio.h> 

#include <locale.h> 

#include <inttypes.h> 

int rnainC) { 

if ( ! setlocale(LC_CTYPE , "")) { 

fprintf(stderr, "Can’t set the specified locale !\n"); 
return 1; 

} 

wchar_t wc = 0x2190; 

printf("%ls %lc\n", L"Schone GriilSe \u2603", wc); 
return 0; 

} 



$ export LC_CTYPE=UTF-8 
$ cc utf8.c 
$ ./a. out 
Schone GruJ3e o <— 



length of wchar_t (16 or 32 bits) is implementation-defined 




class Test { 

public static void main (String[] argv) { 
String s = "xxx \u2603"; 

System . out . println(s) ; 

} 

} 



$ javac Test. java 

$ java -Dfile.encoding=UTF-8 Test 
xxx 8 



wide characters size is defined as 16 bits 



Encoding Conversions 



$ file utf8.txt 

utf8.txt: UTF-8 Unicode text 

$ iconv -f utf8 -t utf-161e utf8.txt > utf-161e.txt 

$ file latinl.txt 

latinl.txt: ISO-8859 text 



Objective-C 

NSString *s0 = @ ,, A M ; 

NSString *sl = @ M \x61 M ; 

NSString *s2 = @ ,, \u2100" ; 

NSString *s3 = @ M \U0001FF00 M ; 

NSString *sl = @ ,, \u2603" ; 
unichar uc = 0x2665; 

NSLog(@ M -- si: %@ %C\ si, uc); // 6 ¥ 

NSString *s2 = [NSString stringWithUTF8String: M \xF0\x9D\x84\x9E"] ; 
NSLog(@"— s2: s2); // | 

NSData *data = [s2 dataUsingEncoding:NSUTF8StringEncoding] ; 
NSLog(@"-- data: data); // <f09d849e> 



Pythc 



X Collation: still compare codepoints 






X Case Conversion restricted to 1:1 case mappings 

»> 'B'. upper () 

'B' 

X Case conversion ignores locale 
X Additionaly, locale is global 
»> import locale 

»> locale . setlocale ( locale . LC_ALL , 
»> s = "istanbul" 

»> s. upper () 

' ISTANBUL ' 





' tr TR ' ) 



Case Conversion - Loca e 



NSString *s = [NSString stringWithFormat :@ M istambul"] ; 

NSLocale *locale = [NSLocale localeWithLocaleldentifier :@"tr_TR"] ; 
NSString *s2 = [s uppercaseStringWithLocaleilocale] ; 

// iSTAMBUL S 



// U+1F600 GRINNING FACE 
NSArray *a = @[@"A", @"\U0001F600" , 



[a enumerateObjectsUsingBlock: A (NSString *s, NSUInteger idx, BOOL *stop) { 
NSLog(@"[%l-u] %@\n", idx, s); 

}]; 



/* 

[0] A 

[ 1 ] 0 @ 
[2] B 

*/ 



[a enumerateObjectsUsingBlock: A (NSString *s, NSUInteger idx, BOOL *stop) { 
NSLog(@" [%lu] %C\n", idx, [s characterAtIndex:0]) ; 

// idx == 1, s = [0xD83D, 0xDE00] , and U+D83D is a high surrogate 

}]; 



/* 

[0] A 
[2] B 
*/ 



X 




$ xcrun swift 

1> import Foundation 

2> var si = "ni\u{OOFl}o" // precomposed 
sis String = "nino" 

3> var s2 = "nin\u{0303}o" // decomposed 
s2 : String = "nino" 

4> si == s2 // canonical equality 
$R0 : Bool = true 

5> si . isEqual (s2 ) // different bytes 
$R1 : Bool = false 




Regex 



$ python3 

»> import re 

»> reg = re. compile ( "\d" ) 

»> gen = ( chr(c) for c in range(0, OxFFFF) if re.match(reg, chr(c)) ) 

»> print ('' .join (gen) ) 

0123456789 ^ momv 

(^o<56^ni<?@c^ 61 <9^00^3 VXtSCJFOO^^^L^F^jDQ-CTi-C^^CTiQ^aj a^o©la met ctb mi CaC^OQC^ficisdsCS 

ri^co o <> / 200 j^9gG9O0Qi p z yi jQ7ooO 919 rn (j # <D FI) Cfi° t=R ^ &G= ‘ s,t: '“0i o ASXCQSV703j55r)GqA 

gODj35gG^O0OO'fdbg6s/8J(^(X)oTtn^(3(3j(^^^^OZjLZr?Zl^Z/'£-O9??cM^\)coO^^6(5^C6oll^ t ti^7>8<l > 

0?^38%8^6osvznrkgui^uonnin,nj|(3(3|€orinu9iAJio?6>^ , irQg < 3g < i809^^rS? < e5W? 0 1 23456789 

»> reg = re. compile ( " \d" , re. ASCI I) 




Regex 





How well do you know your tools? 



illegal code points 
length? (code points? bytes?) 
equality, equivalence, norm, 
reversing strings 
character at index 



• iterating over all symbols 

• substring 

• regex 

• bi-directional text 

• text segmentation 




4. Unicode Hacks 




April Arcus 

@aprilarcus 



o 



Follow 



Yo Unicode I herd U like typefaces so we put 
some codepoints in your Supplementary 



encodw IF© fffllt 5 



your fonts. 



RETWEETS 

823 

10:03 AM - 14| 



\\ S.I Wall Street Journal 

125 @wsj 



O 



Follow 




Last 1 2 months 



of the U.S. unemployment rate, which rose 



to 9% in April. More data: 
http://on.wsj.comF 



.nil AT&T LTE 



17:59 



Messages 



dr-cotjoo'x 




Draw something in the left box! 

And let shapecatcher help you to find the most similar Unicode 
characters! 

Currently, there are 11817 Unicode character glyphs in the database. 
Japanese, Korean and Chinese characters are currently not 
supported. 



195 3,069 




O Flattr 



0 Tweet 



(3 Like 



Share 



>0 Follow @Shapccatchcr 



Recognize Clear 




Domino tile horizontal-01 -01: 1:53 

Unicode hexadecimal: 0x1f039 
In block: Domino Tiles 
Rate this suggestion: good I bad 
More Info 

score: 0.843336 




New game started! You're 
playing white. Make a 
move (e.g. e2e4) 



liitil 41 

AAA AAA AA 

a a a a 
a a a a 
a a a a 
a a a a 

AAAAAAAA 



r. 



K4iftti4K 

AAA A A A A 

a a a a 
a a*a a 
a a-a a 
a a a a 

A A A A AAA 
A t&ft 6B j 




l.f Text Messaae 



-f 71% 




e2e4 



\ 

J 








Nicolas Seriot 

@nst021 



$ python unibinary.py -s " 




■p 







-fc 




* — n-j’ — ^ 

Tfj’rfTEJLx-L^H — T 

ilil'f hnf 




L 

T 




\\cm 




rf? 



fA v -^RrfJ 




if 'J if Y & # 




X 



TTH *§g$iffix-L— ffiSS— xx" > 



m 



RETWEETS FAVORITE 



1 



8:29 PM -17 Jan 2013 





Nicolas Seriot 

@nst021 



$ chmod +x m 

$ ./m 

Hello world 



RETWEETS FAVORITE 



1 



8:29 PM -17 Jan 2013 





Pack 289+ ASCII chars or 209+ bytes into 140 characters. 

https://github.com/nst/UniBinarv 





Unicode Security 



« Unicode is just too complex to ever be secure. » 

- Bruce Schneier, 2000 

https://www.schneier.eom/crvpto-gram-0007.html#9 

• TR#36 Unicode Security Considerations 

• TR#39 Unicode Security Mechanisms 




Chris Weber's http://websec.github.io/unicode-securitv-guide/ 



Illegal Sequences 



Illegal UTF-8 sequences include: 
- overlong encoding 



1 


1 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


1 


unexpected continuation byte 


1 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 



OxCO 0x41 
OxCO 0x00 



Illegal UTF-16 sequences include unpaired surrogates such as: 

- [ 0xD800-0xDBFF ] not followed by [ OxDCOO-OxDFFF ] 

- [OxDCOO-OxDFFF] not preceded by [ 0xD800-0xDBFF] 





Exploiting Transformations 

Exploitation of normalization to add / remove characters and bypass filters 

Non-characters: U+FFFE , U+FFFF, U+1FFFE, U+1FFFF, U 
+10FFFE , U+10FFFF 

Non-character code points must not be simply deleted (as allowed by Unicode 
< 5.2 C7) but replaced by ^U+FFFD REPLACEMENT CHARACTER. 

<a href=" java\uFEFFscript: alert ( « XSS")> 

Unassigned code points (eg. U+2073) 



Spotify Labs 


P Search 


Think it. Build it. Ship it. Tweak it. Blog it. 





Home About Puzzles 



Posted on June 18, 2013 by Mikael Goldmann 




Creative usernames and Spotify 
account hijacking 



A bunch of us dropped whatever we were working on and scurried to try to understand 
what was going wrong and how to fix it. From the forum post we knew that taking over an 
account went something like this: 

1 . Find a user account to hijack. For the sake of this example let us hijack the account 
belonging to user bigbird. 

2. Create a new spotify account with username bigbird (j n python this is the string 
u’\u1 d2e\u1 d35\u1 d33\u1 d2e\u1 d35\u1 d3f\u1 d30'). 

3. Send a request for a password reset for your new account. 

4. A password reset link is sent to the email you registered for your new account. Use it 
to change the password. 

5. Now, instead of logging in to account with username BIGBIRD , try logging in to account 
with username bigbird with the new password. 

6. Success! Mission accomplished. 

From the log lines associated with the hijacking of the forum manager’s account it 
appeared to be a problem with how we derived a canonical username from the username 
the user chooses at registration, but we were still pretty much in the dark. We had no 
option except to disable account creation until we couid prevent the attack. 



https://labs.spotifv.com/2013/06/18/creative-usernames/ 




Visual Spoofing 



AAA A A a A 



WWW.gOOgle.com - U+0 06 7 LATIN SMALL LETTER G 
www.google.com -U+02 61 LATIN SMALL LETTER SCRIPT G 



8- U+09EA BENGALI DIGIT FOUR 

9- U+0B68 ORIYA DIGIT TWO 








$ gdb Twitter 




Nicolas Seriot 

@nst021 



(gdb) r 

Starting program: /Applications /Twitter . app/Contents/MacOS/Twitter 



Here is a nice little Core Text crasher for OS 
X: $ python -c "print 



u '\u0647\u0020\u0488\u0488\u0488 1 " 



RETWEETS FAVORITES 

38 34 

10:49 AM -25 Mar 2013 








Program received signal EXC_BAD_ACCESS , Could not access memory. 
Reason: KERN_INVAL I D_ADDRE S S at address: 0x00000001084e8008 
0x00007f f f 9432ead2 in vDSP_sveD ( ) 



(gdb) bt 

#0 0x00007f f f 9432ead2 
#1 0x00007f f f 934594f e 
#2 0x00007f f f 93457d5c 
#3 0x00007fff 93457 9ee 
#4 0x00007fff 93466764 
#5 0x00007f f f 93467e2c 
#6 0x00007f f f 93467d58 
#7 0x00007f f f 93467bf e 
#8 0x00007f f f 934858ae 
#9 0x00007f f f 93485 1 10 
#10 0x00007f f f 93484af 2 



in vDSP_sveD ( ) 

in TStorageRange : : SetStorageSubRange 
in TRun : : TRun ( ) 
in CTGlyphRun: :CloneRange ( ) 
in TLine : : SetLevelRange () 
in TLine: : SetTrailingWhitespaceLevel 
in TRunReorder : : ReorderRuns () 
in TTypesetter : : FinishLineFill () 
in TFramesetter : : FramelnRect () 
in TFramesetter :: CreateFrame () 
in CTFramesetterCreateFrame ( ) 



o 



o 




ars technica 






A MAIN MENU MY STORIES: 25 


FORUMS SUBSl 


CRIBE JOBS 



Operating Systems Applications Developer Verity Stob 



Anatomy of a killer bug: How just 5 characters 
can murder iPhone, Mac apps 

What evil lurks in the Unicode of Death ... oh, a buffer overrun 



INFINITE LOOP 




Rendering bug crashes OS X, iOS apps with 
string of Arabic characters (Updated) 

CoreText bug crashes any iOS 6 and OS X programs that use the API. 



by Andrew Cunningham and Dan Goodin Aug 29 2013, 9:30pm CEST 



El Share I* Tweet I 14£ 



By Chris Williams, 4 Sep 2013 



LATEST FEATURE: 



93 




RELATED 

STORIES 



Analysis There has been much sniggering into sleeves after wags found they could 
upset iOS 6 iPhones and iPads, and Macs running OS X 10.8, by sending a simple 
rogue text message or email. 



A bug is triggered when the CoreText component in vulnerable Apple operating 




FFATURF STORY II PAGF1 



MOST READ MOST COMMENTED 

YARR! Pirates walk the plank: DMCA 
magnets sink in Google results 

Whisper tracks its users. So we tracked 
down its LA office. This is what happened 
next 

Xperia Z3: Crikey. Sony - ANOTHER 
flagship phondleslab? 

Ex-US Navy fighter pilot MIT prof: Drones 
beat humans - 1 should know 

Apple flings iOS 8.1 at world+dog: Our 
AMAZEBALLS 9-step installation guide 











U+202E RIGHT-TO-LEFT OVERRIDE 



$ python3 -c "print ( ' ABC\u202EDEF ') " 

ABCFED 

# copy-paste gets crazy 



$ python3 -c "print (' x\u202Efdp.doc ') " 

xcod.pdf 

# double click a .pdf, open a .doc 





HFS+ 



Important: The terms used in this Q&A, precomposed and decomposed, roughly 
correspond to Unicode Normal Forms C and D, respectively. However, most volume 




formats do not follow the exact specification for these normal forms. For example, HFS 
Plus (Mac OS Extended) uses a variant of Normal Form D in which U+2000 through 
U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed 




(this avoids problems with round trip conversions from old Mac text encodings). It’s likely 
that your volume format has similar oddities. 



Apple Technical Q&A QA1173 



• Terminal.app (and most apps) output NFC UTF-8. 

• The filenames you write are different from the ones you read. 



HFS+ 



$ echo u; echo u | 


xxd 


• • 
u 




0000000: c3bc 0a # 


NFC 


$ touch u; Is; Is | 


| xxd 


• • 
u 




0000000: 75cc 880a 


# NFD 



$ touch " Bucher " 

$ Is Bii<TAB> # no completion 
$ Is Bu<TAB> # completion 





OS X Bash 



$ mkdir /tmp/test 
$ cd /tmp/test 

$ touch "printf « a\xef \xbb\xbfb" 

# or " a\uFEFFb" . encode ( ' utf-8 ' ) 

$ Is a* 

a?b 

$ touch ab 
$ Is a* 
a?b 

# where did ab go?! 




OS X Finder 



$ echo -e "\xFF\xFE" > x.txt # UTF-16LE BOM 
$ xattr -w com. apple . Text Encoding "utf-161e" x.txt 
$ qlmanage -p x.txt # or QuickLook with Finder 

[ERROR] An uncaught exception was raised outside of any generator: *** - [NSConcreteTextStorage attribute : atlndex : longestEff ectiveRange : inRange :] : Range or index out of 
bounds 

2014-10-24 10:53:08.474 qlmanage [ 5268 : Ilf ] *** Terminating app due to uncaught exception ' NSRangeException ' , reason: '*** -[NSConcreteTextStorage 

attribute : atlndex: longestEff ectiveRange : inRange :] : Range or index out of bounds' 

*** First throw call stack: 

0 CoreFoundation 

1 libob jc . A. dylib 

2 CoreFoundation 

3 AppKit 

4 AppKit 

5 AppKit 

6 AppKit 

7 AppKit 

8 AppKit 

9 AppKit 



0x00007f f f 89ebe25c exceptionPreprocess + 172 

0x00007 f f f 87934e75 ob jc_exception_throw + 43 
0x00007f f f 89ebel0c + [NSException raise : format : ] + 204 

0x00007f f f 81a83a7a -[NSConcreteTextStorage attribute : atlndex: longestEff ectiveRange : inRange : ] + 118 

0x00007f f f 81951ded - [NSMutableAttributedString (NSMutableAttributedStringKitAdditions ) f ixGlyphlnf oAttributelnRange : ] + 204 

0x00007f f f 81951cd8 - [NSMutableAttributedString (NSMutableAttributedStringKitAdditions ) f ixAttributesInRange : ] + 39 

0x00007f f f 81a838el - [NSTextStorage processEditing ] + 109 

0x00007f f f 81a7f 742 -[NSTextStorage endEditing] + 110 

0x00007f f f 81c5db4f _NSReadAttributedStringFromURLOrData + 14525 

0x00007f f f 81c5e3a5 - [NSAttributedString (NSAttributedStringKitAdditions ) initWithURL : options : documentAttributes : 



# watch your Finder go nuts ! ! ! 

$ cd; touch "printf " \x41\xe9" 

# NFC ( "Ae" ) 

$ open . 

# fixed in OS X 10.10 




Conclusion 



• Unicode is cool. Unicode is hard. 



• Everything dealing with Unicode is a bug nest. 

• You cannot just ignore Unicode, you're using it. 



• Most APIs should use strings instead of a single char. 
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